Targeted Gene Metagenomic Data Analysis ◾ 259
7.2.3.2 VSEARCH
VSEARCH [12] is another alignment-based and open-source tool like BLAST. However,
unlike BLAST, which uses seed-based heuristic search algorithm, it uses a fast heuris-
tic search by identifying a small set of database sequences that have many k-mer words
in common with the query sequence (a representative sequence). VSEARCH algorithm
counts the number of shared words between a representative sequence and each data-
base sequence (a word will count once if it appears multiple times). Thus, the similarity
between a representative sequence and a target sequence is based on the statistics of shared
words rather than alignment. Then, the algorithm performs pairwise global alignment
(Needleman-Wunsch) for the query sequence with each sequence of the database begin-
ning with the sequences with the largest number of words in common with the representa-
tive sequence. The optimal global alignment score is computed for each alignment. Unlike
BLAST, exhaustive pairwise alignment is computationally expensive. However, VSEARCH
employs parallel strategies by using vectorization and multiple threading that reduce the
computational cost.
7.2.3.3 Ribosomal Database Project
Ribosomal Database Project (RDP) [13] Classifier is a machine learning taxonomy classifi-
cation method. It is a model trained by known reference sequences with known taxonomic
groups. After training, the classifier is used to predict the taxonomy of the representative
sequences. RDP uses the naïve Bayesian algorithm to accurately predict the taxonomy of a
sequence. The model is trained by N sequences with known taxonomy. First, the algorithm
forms a dictionary from all possible 8-base subsequences or words from all sequences. The
words of all N sequences are called corpus. Let W = {w1, w2, …, wd} be the set of all possible
eight-character words in the corpus and n(wi) be the number of sequences containing word
wi, where i=1, 2, …, d; thus, the expected-likelihood estimate of observing word wi in a
sequence is calculated as
P
n w
N
i
i
[
]
(
)
(
)
=
+
+
0.5
1
(7.5)
The numbers 0.5 and 1 are used to keep the value of Pi to be between 0 and 1 (e.g., 0 Pi <1).
Assume that m(wi) is the number of the word wi contained by M training sequences
whose genus is G. Thus, the conditional probability that a sequence of genus G contains the
word wi is estimated as:
P w G
m w
P
M
i
i
i
[
]
(
)
(
)
=
+
+
|
1
(7.6)
Now assume that a sequence S of genus G has the observed set of words V={v1, v2, …, vf}
V⊑W; thus, the joint probability or likelihood of observing the sequence S is the product of
the probability of each word given genus G as